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Foreword 


The substance of this report has been developed through consultation with 
staff members from the offices of Research and Development and Classifications 
Operations of the Patent Office, and from the Data Processing Systems Division 
of the National Bureau of Standards. For the helpful criticism of many contribu- 
tors, then, the author isgrateful. Particular appreciation is due the Staff Director 
for his advice and sympathetic encouragement. 

The responsibility for the organization and presentation of these controversial 
materials, however, istheauthor’s alone. Accordingly, he welcomes illuminating 


comment from any source in this or related fieids of interest. 


Simon M. Newman 
(4) 
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LINGUISTIC PROBLEMS IN MECHANIZATION OF PATENT SEARCHING 


INTRODUCTION* 


Linguists have taken an interest in the mechani- 
zation of Patent Office searching. It is clear that 
we will need their talents and their methodology if 
mechanization is to be achieved. In order that we 
may utilize their skills efficiently, our peculiar 
linguistics must be projected for analysis. It also 
seems advisable to collect in one place the many 
references which have been made to our language 
problems and to supplement these references with 
additional material which may guide the interested 
reader, 2 


CODING TERMINOLOGY 


Unambiguous Language 


The necessity for stating unambiguously the 
“notions” or phenomena present in patent dis- 
closures has been reviewed, and one proposed sys- 
tem has been described in some detail.3 

Any system for the mechanization of searching 
will require the formulation of unambiguous terms. 
Each term will serve either as a code itself or 
as a designation of an unambiguous code. Coding 
is thus the process of creating specific unambiguous 
terms as substitutes for “notions.” Encoding is 
the general process of preparing a document for 
Storage by the application of these code terms. 


Drawings and Dlustrations 


A patent disclosure always includes a verbal 
Specification, followed by a series of claims which 
define the limits of the invention. Where the nature 
of the invention permits, the disclosure must in- 
clude a drawing to which the verbal specification 
refers. The majority of patents include drawings. 
Usually they illustrate the structure of the device 
being claimed; however, they may include or may 
be a graph, a flow sheet, or a circuit diagram. 

If no drawing is included, a patent searcher 
must read the specification, and his search, though 
much more detailed, is similar to any other litera- 
ture search. However, when the patent includes a 
drawing, a search, whether generic or specific, 
almost invariably involves a visual examination 
of the drawing, rather than an examination of the 
printed specification. In such a search, spot 
reading of the specification is usually restricted 
to resolving ambiguities raised in the searcher’s 
mind about some detail of the drawing, or to de- 


termining the uses to which the disclosed structure 
May be put. 


The use of drawings rather than the verbal 
specifications creates one of the basic language 
problems in the mechanization of searching. The 
adage that “a picture is worth ten thousand words” 
particularly applies to the use of patent drawings 
in manual searching. Complex shapes, interre- 
lations of the functions of their constituent ele- 
ments, and details of their topological orientation 
often can be comprehended at a single glance. In- 
formation from graphs can usually be assimilated 
faster than that from the equations which generate 
them. Circuitry, both electrical and hydraulic, 
can be followed quickly and accurately when shown 
in line drawings. Flow sheets—though technically 
not drawings—serve to abstract the essence of 
complex processes and hence often can eliminatea 
tedious and unnecessary study of a specification. 

Some drawings have details of shapes and their 
interrelations which are clear and unambiguous, 
even though they are not described in the text of 
the specification. These illustrated details are as 
valid for Patent Office search purposes asa verbal 
text describing them. Any coding system must 
provide for the formulation of terminology em- 
bracing all such details. Although it may be pos- 
sible to store drawings, as such, in a machine 
memory, any search request might be made either 
in linguistic terms or in the form of illustrations. 
Therefore any disclosure so stored also must be 
encoded in linguistic terms. 


Static Structures 


Static structures can be encoded solely in terms 
of the size, shape, and topological (orientational) 
relationship of their parts. Theterminology drawn 
from Geometry and Trigonometry will serve to 
code such factors. Possibly other terms, such as 
fillet, joint, lamina, can be defined unambiguously. 
Most names now commonly employed to describe 
objects, however, are not helpful in uniquely de- 
scribing their structure, since they are usually 
either functional or descriptive of some incidental 
property of the structure. 

One complication in structural encoding occurs 
if two or more well-recognized organizations of 
parts have one element in common. The occur- 
rence of this one part intwo separate organizations 
requires some encoding principle which will allow 
retrieval of this part in either organization oras a 
common part of both. 

The problem of coding interrelations has been ex- 
haustively analyzed, and one solution has been 
proposed, 3 
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Functions 


As implied above, a new terminology of functions 
(uses) of structures is also needed. These terms 
must be directed no? to the disclosed accidental 
use, but to some expression of basic use defining 
what has been called the necessary or proximate 
function.5 Terminology derived from proximate 
function can best be illustrated by one of the very 
few situations in which such concepts already have 
been defined. For example, in an analysis of a 
series of patents directed to methods of shaping 
devices from metal pieces, it was determined that 
there were only three basic mutually exclusive 
methods: 


(1) Assembly.—The addition of some extraneous 
material to a single unitary structure; e.g., riveting 
or welding two girders together. 

(2) Parting.—The removal of some material from 
a single unitary structure; e.g., cutting, punching, 
drilling, turning, etching or sawing. 

(3) Reshaping.—The change of physical dimen- 
sions of one unit without assembly or parting; e.g., 
rolling, forging, coining or bending. 


Shaping of this sort comprises one form of manu- 
facturing.© By the addition of two other mutually 
exclusive methods, it would appear that the entire 
field of present manufacturing can be encompassed. 
These two groups are: 


(4) Quantum fluctuation.—The so-called “changes 
of state” of matter among gases, liquids, and sol- 
ids; e.g., melting, condensing or sublimation. 

(5) Generation.—The creation of new things by 
atomic or sub-atomic recombination; e.g., chemi- 
cal reactions resulting in precipitation, or atomic 
fission and fusion. 


In such a broadened set of categories, one would 
also include in assembly, the filling of a mattress 
with felted cotton; in parting, the tearing of the end 
of a cigarette package; and in reshaping, the mold- 
ing of clay. Other manufacturing processes, of 
course, may include combinations of these classes. 
For example, the baling ofhay is bothassembly and 
reshaping. 

These five classes of manufacturing do not, of 
course, exhaust function terminology. Many other 
processes, including measuring, testing, transport- 
ing, transmitting of electrical energy, modifying 
of conditions of pressure and temperature, and pro- 
jection of optical images must be analyzed like- 
wise. 


Apparatus 


Apparatus illustrated in patent drawings include 
(1) those consisting solely of static structural 
parts, (2) those which include one or more parts 
which may be removed and reassembled from 
another part, (3) those including one or more inci- 
dentally movable but independently operable parts, 
(4) those which include one or more series of in- 


terconnected and usually intercontrolled movable 
parts, and (5) combinations of one or more of these 
four groups. 


An example of a purely static structural appa- 
ratus of group (1) is the conventional core-type 
automotive radiator. This same radiator may have 
as an adjunct a cap that may be removed for filling 
the core and then be reassembled, thus exemplify- 
ing a combination of groups (1) and (2). This radi- 
ator might also disclose a simple plug valve at its 
lower portion, movable to draining position for 
emptying the core. This disclosure then would 
illustrate a combination of groups (1) and (3). A 
disclosure of the body portion of a fountain pen 
would include interrelated elements which when op- 
erated constitute the filling mechanism. Thisisan 
example of groups (1) and (4) in combination. 

The coding of static structures has already been 
considered. But the creation of additional termi- 
nology for similar encoding of both the independ- 
ently removable and the interrelated moving parts 
with each other and with static structures will be 
necessary. In the solution to this part of our prob- 
lem, it is likely that some coding principle can be 
evolved which is analogous to proximate function, 
the principle previously suggested as governing 
one choice of manufacturing process terms. How- 
ever, the proximate function principle itself defi- 
nitely is not applicable to apparatus, for although 
the processes of forging and rolling are closely 
similar, a forging press and a rolling stand are 
entirely different apparatus; they are correctly de- 
scribed only in terms of the organization of their 
parts, and not by what function these parts may ac- 
cidentally perform. 


For some time it has been clear that both the 
nineteenth century basis of classification—that of 
material worked on? and the later used basis of an 
accidental function of the apparatus® have proven 
ineffective in segregating into classes those dis- 
closures which are pertinent to normal search re- 
quests. It has been suggested recently that one 
basis for categorizing manufacturing apparatus is 
the relative movement of tool (or its holder) with 
the work (oritsholder), This approach is now being 
utilized in a reclassification project involving cut- 
ting machines.? Some of the proposed first line 
(unindented) titles!° of subclasses, arranged inthe 
order of decreasing complexity, are found in Figure 
1. It seems clear that apparatus having tools other 
than cutting tools can be similarly categorized, and 
that machines falling into such categories haye 
Similar characteristics. It is possible that this 
approach will offer a key to the solution of one part 
of this problem. 


All apparatus, of course, does not relate to manu- 
facture, nor does all manufacturing apparatus 
utilize tools. Terminology is also needed for @ 
host of other devices, such as computing, projec- 
tion, and transport apparatus. 
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Tool engaged work during dwell of intermittent 
work feed 


Cutting motion of tool has component in direc- 
tion of moving work 


Transverse cutter with motion normal to run- 
ning length work 


Interrelated tool feed and work guide moving 
means 


Interrelated tool feed means and means to ac- 
tuate work immobilizer 


Interrelated tool and work feeding means 


CUTTING APPARATUS CATEGORIES 
Figure 1 


It has been postulated that apparatus terminology 
should not be derived from either the process per- 
formed or the material worked upon. In the 
process of encoding, however, this conclusion does 
not preclude the use of additional descriptors of 
process performed or material worked upon, pro- 
vided that these descriptors are drawn from the 
unambiguous terms previously created to describe 
the material or the process. 


Chemical Compounds 


In the field of chemical compounds several sets 

of terms and rules for their application arein use. 
Although agreement on these rules has not been 
achieved, there is substantial agreement about the 
basic source from which terminology is drawn. 
The resolution of an unambiguous terminology for 
chemical compounds rests primarily on standardi- 
zation. However, some or all of the definition 
problems previously discussed are present in 
chemical structures. For example, in complex 
organic compounds we find that one or more 
elements may be common to two or more parts of 
a compound, e.g., the common carbon atoms in 
two fused rings. Occasionally one or more atoms 
May resonate between two separate points of at- 
tachment. Solutions to some of these problems in 
one field may well result in their resolution in the 
other field. 
: In addition, there are many instances in which it 
is desirable to designate a class for which there 
is no existing term. The patent profession has 
Tesorted to a logical artifice in postulating this 
class as a restricted form of a more inclusive 
class. This artifice is known as a Markush!! ex- 
Pression. For example, it may be necessary to 
refer to a class which includes less than all known 
acids, e.g.: 


An acid selected from the group consisting of 
carboxylic acids and sulfonic acids. 


In this example the two specified types are not in- 
tended as alternative species of the more inclusive 
genus acid. Rather, the complex of all the 


characteristics common to these two species con- 
stitute the characteristics of that class for which 
no single term exists. 

A more complex Markush expression designates 
the desired class in terms of a structural formula, 
with artificial class designations of some or all of 
its constitutents, e.g.: N 


A compound of the formula 2 TY wherein 


X is a member selected from the group consist- 
ing of -OH,-Cl,and -CH3; Y is a member se- 
lected from the group consisting of -H, -CH,,and 
-CHy-CHg, and Z is a member selected from the 
group consisting of -SH, -SO,H, and -CNS. 


If this were the form of a retrieval question it 
should be answered by any one ofthe 27 members of 
the class stated in the question. In the develop- 
ment of an encoding scheme for recording such 
artificial genera, both the inclusiveness of the de- 
fined class and the identity of the members con- 
stituting that class must be preserved. 


Other Word Classes 


In addition to those general classes of words 
which have been discussed, there are many other 
classes which occur in expository prose which 
must be considered in creating a comprehensive 
encoding procedure. There are qualifiers and 
quantifiers, both of which have been elsewhere 
considered!2, 

At present such entities as size, time, mass, 
and temperature, are each measured by diverse 
standards. For each form of measurement there 
must be concurrence in a single, unambiguous 
standard. Distances of .01 millimeter and of 10,000 
light years (1 light year = 6 x 10!? miles), for ex- 
ample, may both occur ina single disclosure re- 
lating to astronomy. Color is sometimes described 
by hue and sometimes by wavelength. Ratios of 
numbers and ranges of values occur frequently in 
disclosures. Other classes of terms (many with 
specialized problems) undoubtedly will be noted by 
specialists in other fields. 


LINGUISTIC ASPECTS OF ENCODING 


Categorization for Generic Searcning 


Any notion or phenomenon may be described by 
one or more facets of its occurrence. The allegory 
of the blind men and the elephant illustrates this 
concept. A particular chemical compound may be 
correctly described by the term "alkane" or by the 
expression “saturated hydrocarbon." A briefcase 
is a “leather article,” a “piece of luggage” and a 
“compartmented receptacle.” Any chosen term 
may be subsumed under myriad classes. A few of 
the diverse classes under which the term “pencil” 
might be subsumed and some of the other members 
of these classes, for instance, are listed in Fig. 2. 
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THER CLASS 
OTHER CLASS MEMBERS 
Writing equipment Typewriter 


Fountain pen 


Wood containing articles Desk 
Hammer handle 


Portable implements Hoe 
Eyeglasses 
Artist’s equipment Sketch pad 
Pallet 
Coating applicators Brush 
Salt-shaker 
Envelope openers Knife 
Scissors 


Indicators (i.e., pointers) Flash light 
Finger 


Silver dollar 
Key 


Pocket tearers 


PENCIL 


Figure 2 


The effective use of coined unambiguous terms 
in an information retrieval process requires that 
they be categorized in a plurality of overlapping 
hierarchies, both for the purpose of encoding a dis- 
closure specifically and generically, and of allowing 
flexibility in formulating a retrieval request in the 
same manner. When one has a pencil before him, 
it is not difficult to perceive many of the generic 
classes to which a pencil may belong. Nor is it 
more difficult, given a series of such broad generic 
terms, to determine other members in each of the 
series. 

At the present time an examiner may formulatea 
retrieval request from a claimed disclosure of an 
article in one of these “other” classes. If the claim 
is drawn in terms broad enough to recite generical- 
ly the characteristics of the “other” class, ingenu- 
ity and imagination are required to discover other 
subsumed terms which are materially different 
from the term for the disclosed object. For ex- 
ample, if the examiner has a disclosure of a salt- 
shaker before him, and the claimisdrawnin terms 
of a coating applicator, he of course searches salt- 
shakers. If he cannot find a salt-shaker which an- 
ticipates the claim, he usually thinks of talc dis- 
pensers, clothes sprinklers, track sanders and the 
like, but seldom of pencils. For some strange 
psychological reason, the presence of the originally 
disclosed object, the ‘salt-shaker, throws a mental 
shadow which tends to hide other objects defined 
by the claim if they are unlike the disclosed object 
in appearance or operation. This situationis faced 
almost every time an inquiry is framed for manu- 
al searching. A categorical scheme suited to the 


logic of a mechanized search will eliminate such 
psychological interference, 13 


Ambiguities 


The resolution of ambiguities either in single 
terms or whole phrases or sentences may require 
evaluation of the context, either in the same sen- 
tence, in other sentences, or even in the drawings. 
For example, in an analysis of qualifying language 
used in a U, S, Patent, the following expression was 
found which referred to a joint between two sepa- 
rate beads: “The resilience of the material per- 
mits the head on one bead being forced through the 
mouth into the socket of another bead.” Interpreted 
in context, this statement was found to mean that 
“there must be enough resiliency to yield without 
breaking or tearing ...and.. . without allowing 
the joint to separate... .”14 

Homonyms constitute a specific form of ambigu- 
ity, and are numerous in the expository prose of 
patent usage, since the jargon of specialized fields 
often appropriates terms from other fields. Inap- 
propriation, they may be given either a narrower 
meaning, e.g., force (in physics); a broader mean- 
ing, e.g., light (when used to include infrared radi- 
ation); a meaning suggested by their shape or func- 
tion, e.g., a coat (of paint); or a meaning which is 
purely arbitrary, e.g., a frog (of a railroad track), 

Two words may be synonyms in one context, but 
not in another. For example, deep and dark when 
used to modify blue might be considered virtually 
synonymous; but when used to modify chamber, they 
definitely have distinct meanings. In describing new 
techniques, scientific writers undoubtedly will use 
the existing vocabulary in other special senses and 
will coin new terms which would not appear in any 
existing lexicons. 


Implied Conceptual Facets of Terms 


An encoding problem closely related to the cate- 
gorization of broader class terms is raised by the 
implied conceptual facets of a single term. The 
designation of the material from which an article 
is made implies all the known properties of that ma- 
terial. The disclosure of an electrically-actuated 
device implies a source of current to operate it. 
Similarly, a term for a disease may imply (J) the 
cause, (2) the part of parts of the body affected, (3) 
the symptoms of and/or tests for its presence, (4) 
the drug or medication used, and its method of and 
apparatus for administration, (5) other methods of 
treatment, and possibly (6) the medical uses to which 
the disease may be put; i.e., a patient may be pur- 
posely infected with the disease as inoculation or 
as a treatment for a different body malfunctioning. 


Before patents can be encoded, all the pertinent 
implied concepts of each explicit term will have to 
be coded. The choice of the pertinent from the 
plethora which may be conjured up is the perplex- 
ing task of the encoder. 
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MECHANIZED ENCODING 


Linguistic problems once solved, practicable 
coding logics formulated, and the searching process 
mechanized, there remains the staggering task of 
encoding the more than three million United States 
patents, the five or more million foreign patents, 
and other disclosures now in the files. The United 
States Patent Office alone is issuing new patents 
at the rate of 50,000 a year. It has accordingly 
been proposed that plans be made to encode by 
mechanical means. Some progress has already 
been made in mechanized pattern recognition. It 
may safely be assumed that within a few years a 
practical print reader will be available which will 
recognize the printed word. Some aspects of re- 
search in mechanized translation of language give 
promise that eventually the meaning of a sentence, 
in context, will be mechanically extracted from the 
printed page, “translated” into a simple, unam- 
biguous language, and stored in a machine memory 
for use in an information retrieval system. Hope- 
fully, as pattern recognition procedures are de- 
veloped, the direct encoding and storing of stand- 
ardized charts or drawings will be possible. How 
non-standard charts and drawings could be encoded 
for such storage without pre- editing (i.e., remaking 
them in standard form) is not apparent at this 
time. 

Furthermore, it is well understood that the trans- 
formation of complicated manual encoding proce- 
dures into a form presentable to a machine will 
entail elaborate programming. ‘Theories for the 
inductive inference programming of data processing 
equipment and their accompanying mathematics 
have been postulated,15 Other empirical heuristic 
logics for the programming of data-processing ma- 
chines are presently under experiment. !® Perhaps 
the key to mechanical encoding lies amid these 
nascent theories. The research efforts of the Pat- 
ent Office must persist in this, as well as in the 
linguistic phase of its general program so that when 
a mechanical search method has been realized, 
manpower will not be needlessly wasted in manually 
encoding and continuously updating the vast files. 
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